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Abstract 

Modern datasets are often in the form of matrices or arrays, potentially having correlations 
along each set of data indices. For example, data involving repeated measurements of several 
variables over time may exhibit temporal correlation as well as correlation among the variables. 
A possible model for matrix-valued data is the class of matrix normal distributions, which is 
parametrized by two covariance matrices, one for each index set of the data. In this article 
we describe an extension of the matrix normal model to accommodate multidimensional data 
arrays, or tensors. We generate a class of array normal distributions by applying a group 
of multilinear transformations to an array of independent standard normal random variables. 
The covariance structures of the resulting class take the form of outer products of dimension- 
specific covariance matrices. We derive some properties of these covariance structures and the 
corresponding array normal distributions, discuss maximum likelihood and Bayesian estimation 
of covariance parameters and illustrate the model in an analysis of multivariate longitudinal 
network data. 

Some key words: Gaussian, matrix normal, multiway data, network, tensor, Tucker decomposition. 

1 Introduction 

This article provides a construction of and estimation for a class of covariance models and Gaussian 
probability distributions for array data consisting of multi-indexed values Y = {yi 1 , ■ ■ ■ , y% K : 
if. E {1, . . . ,171k}, k = 1, . . . ,K}. Such data have become common in many scientific disciplines, 
including the social and biological sciences. Researchers often gather relational data measured 
on pairs of units, where the population of units may consist of people, genes, websites or some 
other set of objects. Data on a single relational variable is often represented by a "sociomatrix" 
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Y = {yij,i E {1, . . . , m},j £ {1, . . . , to}, i ^ j}, a square matrix with an undefined diagonal, where 
Hi j represents the relationship between nodes i and j. 

Multivariate relational data include multiple relational measurements on the same node set, 
possibly gathered under different conditions or at different time points. Such data can be repre- 
sented as a multiway array. For example, in this article we will analyze data on trade of several 
commodity classes between a set of countries over several years. These data can be represented 
as a four- way array Y = {yijkt}-, where Ui,j,k,t records the volume of exports of commodity k 
from country i to country j in year t. For such data it is often of interest to identify similarities 
or correlations among data corresponding to the objects of a given index set. For example, one 
may want to identify nodes of a network that behave similarly across levels of the other factors of 
the array. For temporal datasets it may be important to describe correlations among data from 
adjacent time points. In general, it may be desirable to estimate or account for dependencies along 
each index set of the array. 

For matrix valued data, such considerations have led to the use of separable covariance estimates, 
whereby the covariance of a population of matrices is estimated as being Cov[vec(Y)] = S2 ® Si. 
In this parameterization Si and S2 represent covariances among the rows and columns of the 
matrices, respectively. Such a covariance model may provide a stable and parsimonious alternative 
to an unrestricted estimate of Cov[vec(Y)], the latter being unstable or even unavailable if the 
dimensions of the sample data matrices are large compared to the sample size. The family of 



matrix normal distributions with separable covariance matrices is studied in Dawid 1 1981 , and an 



iterative algorithm for maximum likelihood estimation is given by Dutilleul 1999 . Testing various 
hypotheses regarding the separability of the covariance structure, or the form of the component 



matrices, is considered in Lu and Zimmerman 2005 , Roy and Khattree [2005 , Mitchell et al. 2006| 
among others. Beyond the matrix-variate case, Galecki [1994 considers a separable covariance 
model for three-way arrays, but where the component matrices are assumed to have compound 
symmetry or an autoregressive structure. 

In this article we show that the class of separable covariance models for random arrays of 
arbitrary dimension can be generated with a type of multilinear transformation known as the Tucker 



product [Tucker 1964, Kolda, 2006 . Just as a zero-mean multivariate normal vector with a given 
covariance matrix can be represented as a linear transformation of a vector of independent, standard 
normal entries, in Section 2 we show that a normal array with separable covariance structure can be 
represented by a multilinear transformation of an array of independent, standard normal entries. 
As a result, the calculations involved in obtaining maximum likelihood and Bayesian estimates 
are made simple via some basic tools of multilinear algebra. In Section 3 we adapt the iterative 



algorithm of Dutilleul [1999 to the array normal model, and provide a conjugate prior distribution 
and posterior approximation algorithm for Bayesian inference. Section 4 presents an example data 
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analysis of trade volume data between pairs of 30 countries in 6 commodity types over 10 years. A 
discussion of model extensions and directions for further research follows in Section 5. 



2 Separable covariance via array-matrix multiplication 

2.1 Array notation and basic operations 

An array of order K, or K -array, is a map from the product space of K index sets to the real 
numbers. The different index sets are referred to as the modes of the array. The dimension vector 
of an array gives the number of elements in each index set. For example, for a positive integer mi, a 
vector in M. 1711 is a one- array with dimension mj. A matrix in W niXm ^ is a two-array with dimension 
(mi, 7712). A .fT-array Z with dimension (mi, . . . , nix) has elements {-z^,...,^ : i k S {1, . . . , m^}, k = 
l,...,K}. 

Array unfolding refers to the representation of an array by an array of lower order via combi- 
nations of various index sets of an array. A useful unfolding is the fc-mode matrix unfolding, or 
/c-mode matricization |De Lathauwer et al. 2000 , in which a i^-array Z is reshaped to form a 



matrix with m k rows and n,-.,y.^ rn k columns. Each column corresponds to the entries of Z in 
which the kth index i k varies from 1 to m k and the remaining indices are fixed. The assignment 
of the remaining indices {ij : j ^ k} to columns of Zm is determined by the following ordering 
on index sets: Letting i = (ii, . . . , ix) and j = (ji, . . . ,jk) be two sets of indices, we say i < j if 
ik < jk for some k and i\ < ji for all I > k. In terms of ordering the columns of the matricization, 
this means that the index corresponding to a lower-numbered mode "moves faster" than that of a 
higher-numbered mode. 

De Lathauwer et al. |2000 define an array-matrix product via the usual matrix product as 



applied to matricizations. The /c-mode product of an mi x • • • x array Z and a n x m k matrix 
A is obtained by forming the mi x • • • x m^-i xnx m^+i x • • • x nix array from the inversion of the 
k-mode matricization operation on the matrix AZ^)- The resulting array is denoted by Z x& A. 
Letting F and G be matrices of the appropriate sizes, important properties of this product include 
the following: 

• (Z Xj F) x fc G = (Z x k G) Xj F = Z Xj F x k G 

• (Z Xj F) XjG = Z xj (GF) 

• Z Xj (F + G) = Z Xj F + Z Xj G. 



|De Lathauwer et al.[ 2000 . A useful extension of the A:-mode product is the product of an array 



Z with each matrix in a list A = {Ai, . . . , A^-} in which A^ £ H> n fc xm fc ) given by 

ZxA = ZxiAiX2-- - xk A#. 
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This has been called the "Tucker operator" or "Tucker product", |Kolda 2006], named after the 
Tucker decomposition for multiway arrays |Tucker 1964, 1966| , and is used for a type of multiway 
singular value decomposition |De Lathauwer et al. 2000 . A useful calculation involving the Tucker 
operator is that if Y = Z x A, then 



(*) 



AfcZ( fe )(Ax 



<8> Ai 



Other properties of the Tucker product can be found in De Lathauwer et al. 2000| and Kolda 
[2006 



2.2 Separable covariance via the Tucker product 

Recall that the general linear group GL m of nonsingular real matrices A acts transitively on the 
space S m of positive definite m x m matrices E via the transformation ASA T . It is convenient to 
think of S m as the set of covariance matrices {Cov[Az] : A E GL m } where z is an m-variate mean- 
zero random vector with identity covariance matrix. Additionally, if z is a vector of independent 
standard normal random variables, then the distributions of y = Az as A ranges over GL m 
constitute the family of mean-zero vector-valued multivariate normal distributions, which we write 
as y ~ vnorm(0, E). 

Analogously, let A = {Ai, A 2 } E GL 

mx ,m2 = GL mi x GL m2 , and let Z be a mi x ?n 2 random 
matrix with uncorrelated mean-zero variance-one entries. The covariance structure of the random 
matrix Y = AiZA^ can be described by the mi x mi x ?n 2 x m 2 covariance array Cov[Y] for which 
the (ii, 12, 31,32] entry is equal to Cov[yi lt j 1 ,yi 2 j 2 \. It is straightforward to show that Cov[Y] = 
Si o S 2 , where Sj = AjAj , j = 1, 2 and "o" denotes the outer product. This is referred to as a 
"separable" covariance structure, in which the covariance among elements of Y can be described by 
the row covariance Si and the column covariance E 2 . Letting "trQ" be matrix trace and "(8)" the 
Kronecker product, well-known alternative ways to describe the covariance structure are as follows: 



E[YY T ] 
E[Y T Y] 
Cov[vec(Y)] 



Si x tr(E 2 ) 
E 2 x tr(Si) 



(1) 



As {Ai, A 2 } ranges over GL mi>m2 the covariance array of Y = AiZA^ ranges over the space of 
separable covariance arrays S mii?n2 = {Si o S 2 : Si E S mi , S 2 G S m2 } |Browne and Shapiro 1991 



If we additionally assume that the elements of Z are independent standard normal random variables, 
then the distributions of {Y = AiZA^ : {Ai, A 2 } E GL mijjn2 } constitute what are known as the 
mean-zero matrix normal distributions |Dawid 1981] , which we write as Y ~ mnorm(0, Si o S 2 ). 

Thinking of the matrices Y and Z as two-way arrays, the bilinear transformation Y = AiZA^ 
can alternatively be expressed using array-matrix multiplication as Y = Z Xi Ai x 2 A 2 = Z x A. 
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Extending this idea further, let Z be an m\ x • • • x uik random array with uncorrelated mean-zero 
variance-one entries, and define GL mi) ... ;mK to be the set of lists of matrices A = {Ai, . . . , Ax} 
with Aj, e GL mk . The Tucker product Z x A induces a transformation on the covariance structure 
of Z which shares many features of the analogous bilinear transformation for matrices: 

Proposition 2.1 Let Y = Z x A, where Z and A are as above, and let E^ = AfcAjT. Then 

1. Cov[Y] = Si o ••• o E x , 

2. Cov[vec(Y)] = <g> ■ • • ® £i, 

3. E[Y (fc) Yf fc) ] = S fc xn^ fc tr(^)- 

The following result highlights the relationship between array-matrix multiplication and sepa- 
rable covariance structure: 

Proposition 2.2 //Cov[Y] = Si o • • • o and X = Y x^ G, then 

Cov[X] = Si o • • • o E fe _i o (G£ fc G T ) o £ fc+1 o • • ■ o E*. 

This indicates that the class of separable covariance arrays can be obtained by repeated single- 
mode array-matrix multiplications starting with an array Z of uncorrelated entries, i.e. for which 
Cov[Z] = I mi o • • • o l mK . The class of separable covariance arrays is therefore closed under this 
group of transformations. 



3 Covariance estimation with array normal distributions 

3.1 Construction of an array normal class of distributions 

Normal probability distributions are useful statistical modeling tools that can represent mean and 
covariance structure. A family of normal distributions for random arrays with separable covariance 
structure can be generated as in the vector and matrix cases: Let Z be an array of independent 
standard normal entries, and let Y = M + Z x A with M G R m ix- xm K and A G GL mi) ... jmjf . 
We say that Y has an array normal distribution, denoted Y ~ anorm(M, Si o • • • o Sjf), where 
Sfc = A fc Aj[. 

Proposition 3.1 The probability density ofY is given by 

p(Y\M, Si, ... , £*) = (2vr)- m / 2 (j[ |E fc |-™ /C 2 "*^ x exp(-||(Y - M) x S^/ 2 || 2 /2), 
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where m = FL mj,, S 1//2 = {A 1 1 , . . . , A^ 1 } and the array norm ||Z|| 2 = (Z, Z) is derived from 
the inner product (X, Y) = ^ • • • J2i K Xi u ...,i K yi u ...,i K - 

Also important for statistical modeling is the idea of replication. If Yi, . . . , Y n ~ anorm(M, Sio 
• • • o Jjk), then the array mi x • • • x vhk x ti array formed by stacking the Yj's together also has 
an array normal distribution: If Yi, . . . , Y n ~ anorm(M, Si o • • • o Sjf), then 

Y = (Yi, . . . ,Y mK ) ~ anorm(M o l n , Si o • • • o S# o I„). 

This can be shown by computing the joint density of Yi, . . . , Y n and comparing it to the array 
normal density. 

An important feature of the multivariate normal distribution is that it provides a conditional 
model of one set of variables given another. Recall, if y ~ vnorm(/i, S) then the conditional 
distribution of one subset of elements y b of y given another y a is vnorm(/z fe | a , S^| a ), where 

Vb\a = P[b] + S [6,fl]( S K«])~ 1 (y,i - /*[o]) 
^b\a = £[6,b] - ^[b,o](^[a,a]) _1S [o,6]> 

with Sy, for example, being the matrix made up of the entries in the rows of S corresponding 
to b and columns corresponding to a. 

A similar result holds for the array normal distribution: Let a and b be non-overlapping subsets 
of {1, . . . , mi}. Let Yfc = {yi lt ... t i K : i\ E b} and Y a = {yi lr ..,i K ■ h G a} be arrays of dimension 
mib x m-2 x • • • x rriK an d mi a x mi x • • • x mjc, where mi a and mi& are the lengths of a and b 
respectively. The arrays Y a and Y^ are made up of non-overlapping "slices" of the array Y along 
the first mode. 

Proposition 3.2 Let Y ~ anorm(M, Si o • • • o S/<). The conditional distribution of Y^ given Y a 
is array normal with mean and covariance ^iM a o S2 o • • • o S^ ; where 

M b]a = M b + (Y a -M a ) Xl (Si^^Si^j)- 1 ) 

^b\a = ^l[b,b] — ^l[6,a](^l[a,a]) 1 ^'l[a,6]- 

Since the conditional distribution is also in the array normal class, successive applications of Propo- 
sition |3.2| can be used to obtain the conditional distribution of any subset of the elements of Y of 
the form {y^ i K : ik £ conditional upon the other elements of the array. 

3.2 Estimation for the array normal model 

Maximum likelihood estimation: Let Yi , . . . , Y n ~ anorm(M, Si o • • ■ o S^), or equivalently, 
Y ~ anorm(M o 1, Si o • • • o S^- o I n ). For any value of S = {Si, . . . , S^-}, the value of M that 
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maximizes p(Y|M, S) is the value that minimizes the residual mean squared error: 
-V||(Y,-M)xE- 1 / 2 || 2 = -V(Y i -M,(Y i -M)xE- 1 ) 

n *s n. ^— ' 



n « — ' n 

i=l i=l 



= (M,Mxr 1 )-2{M,Yxr 1 ) + c 1 (Y,S) 
= (M- Y,(M- Y) x E- 1 ) + c 2 (Y,S) 
= ||(M- Y) x E- 1 / 2 || 2 + c 2 (Y,E). 

This is uniquely minimized in M by Y = ^ Yj/n, and so M = Y is the MLE of M. The MLE of S 
does not have a closed form expression. However, it is possible to maximize p(Y|M, S) in S^, given 
values of the other covariance matrices. Letting E = Y — M o l n , the likelihood as a function of Sjt 
can be expressed as p(Y|M,£) oc |£ fe |-W(2m fe ) exp{-||E x {S~ 1/2 , . . . , S^ 1/2 , I n }| | 2 /2}. Since 
for any array Z and mode k we have ||Z|| 2 = ||Z( fc )|| 2 = tr(Z^Z^), the norm in the likelihood 
can be written as 

||ExS~ 1/2 || 2 = ||E x k ^l 1,2 \\ 2 

= ^(S- 1/2 E (fc) Ef fc) 5] fc 1/2 ) 
= tr(S~ E( fc) E (fc) ), 

where E = E x {Sj ^ 2 , . . . , S^,^ 2 , 1^., S fcH ^ 2 , . . . , S x 1//2 , I n } is the residual array standardized along 
each dimension except k. Writing S k = E^E^, we have 

p(Y|M,S) oc |S fe |- nni /( 2m *)etr{-E- 1 S A /2} 

as a function of and so if is of full rank then the unique maximizer in S& is given by 
Sfc = S k /n k , where = nm/mk = n x Ylj^ k mj is the number of columns of Y^), i.e. the 
"sample size" for the feth mode. This suggests the following iterative algorithm for obtaining the 
MLE of S: Letting E = Y - Y o l n and given an initial value of S, for each k £ {1, . . . , K} 

1. compute E = E x {S~ 1/2 , . . . , S^f, I fe , S~+{ 2 , . . . , S~ 1/2 , I n } and S k = B {k) t^ k) ; 

2. set Sfc = S k /n k , where n k = n x \\j^ k rrij. 

Each iteration increases the likelihood, and so the procedure can be seen as a type of block coor- 



dinate descent algorithm Tseng, 2001 . For the matrix normal case, this algorithm was proposed 
by Dutilleul |1999 and is sometimes called the "flip-flop" algorithm. Note that the scales of 
{Si, . . . , 'Sk} are not separately identifiable from the likelihood: Replacing Si and S 2 with cSi 
and S 2 /c yield the same probability distribution for Y, and so the scales of the MLEs will depend 
on the initial values. 
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Bayesian estimation: Estimation of high-dimensional parameters often benefits from a com- 
plexity penalty, e.g. a penalty on the magnitude of the parameters. Such penalties can often be 
expressed as prior distributions, and so penalized likelihood estimation can be done in the context 
of Bayesian inference. With this in mind, we consider semiconjugate prior distributions for the 
array normal model, and their associated posterior distributions. 

A conjugate prior distribution for the for the multivariate normal model y± . . . y n ~ vnorm(/x, S) 
is given by S) = p(/i|S)p(S) , where p(S) is an inverse- Wishart density and p(/x|S) is multi- 
variate normal density with prior mean /x and prior (conditional) covariance S/ko- The parameter 
no can be thought of as a "prior sample size," as the the prior covariance S/zto f° r t 1 is the same as 
that of a sample average based on kq observations. Under this prior distribution, the conditional 
distribution of /x given the data and S is multivariate normal, and the condition distribution of S 
given the data is inverse- Wishart. An analogous result holds for the array normal model: If 

M|S ~ anorm(M ,£i o • • • o S^/k ) 
S& ~ inverse- Wishart ( S^ 1 , ^ofe) 

and Si, ... , Y^k are independent, then straightforward calculations show that 

M|Y, S ~ anorm([K M + nY]/[rc + n],Si o • • • o T, k /[k + n}) 
S fe |Y, S_ fc ~ inverse-Wishart([Sofc + S fc + R( fe )R( fe) ] _1 ,i/ fc + n A: ), 

where and are as in the coordinate descent algorithm for maximum likelihood estimation, 

and R = 7^( Y " M °) >< <^ 1/2 , - , I, <( 2 , ■ ■ ■ , V K 1/2 }. 

As noted above, the scales of {Si, . . . , S^} are not separately identifiable from the likelihood. 
This makes the prior and posterior distributions of the scales of the S^'s difficult to specify or 
interpret. As a remedy, we consider reparameterizing the prior distribution for Si,...,S^ to 
include a parameter representing the total variance in the data. Parameterizing Sofc = tSq/c for 
each k G {1, . . . , K}, the prior expected total variation of Yj, tr(Cov[vec(Yj)]), is 

E[tr(Cov[vec(Y)])] = E[tr(S^ ® ■ ■ ■ ® Si)] 

K 

= E[JJtr(S fc )] 

k=l 

K K 

= Yl tr(E[S fc ]) = 7^ II tr(S ofe )/(i/ofc - m k - 1). 
k=l k=l 

A simple default choice for Sofc and VQk would be So/t = I mk /mk and vq^ = rrik + 2, for which 
E[tr(Sfe)] = 7 and the expected value for the total variation is j K . Given prior expectations about 
the total variance, the value of 7 could be set accordingly. Alternatively, a prior distribution could 
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be placed on 7: If 7 ~ gamma(o, b) with prior mean a/b, then conditional on Si, ... , S/c, we have 

7|Si,...Ej<- ~ gamma (a + ^ v 0k m k /2, b + ^ tr(£ fc l £ fc)/2j . 

The full conditional distributions of {M, Si, . . . , E^ , 7} can be used to implement a Gibbs 
sampler, in which each parameter is sampled in turn from its full conditional distribution, given 
the current values of the other parameters. This algorithm generates a Markov chain having a 
stationary density equal to p(M, Si, ... , S^, 7I Y), samples from which can be used to approximate 
posterior quantities of interest. Such an algorithm is implemented in the data analysis example in 
the next section. 



4 Example: International trade 

The United Nations gathers yearly trade data between countries of the world and disseminates this 
information at the UN Comtrade website http://comtrade.un.org/. In this section we analyze 



trade among pairs of countries over several years and in several different commodity categories. 
Specifically, the data take the form of a four-mode array Y = {yij t k,t} where 

• i G {1, . . . , 30 = m} indexes the exporting nation; 

• j € {1, . . . , 30 = m} indexes the exporting nation; 

• k £ {1, . . . ,6 = p} indexes the commodity type; 

• t G {1, . . . ,10 = n} indexes the year. 

The thirty countries were selected to make the data as complete as possible, resulting in a set of 
mostly large or developed countries with high gross domestic products and trade volume. The six 
commodity types include (1) chemicals, (2) inedible crude materials not including fuel, (3) food 
and live animals, (4) machinery and transport equipment, (5) textiles and (6) manufactured goods. 
The years represented in the dataset include 1996 through 2005. As trade between countries is 
relatively stable across years, we analyze the yearly change in log trade values, measured in 2000 
US dollars. For example, 2/1,2,1,1 is the log-dollar increase in the value of chemicals exported from 
Australia to Austria from 1995 to 1996. We note that exports of a country to itself are not defined, 
so Vi,i,j,t is "not available" and can be treated as missing at random. 

We model these data as Y = M o l n + E , where M is an m x m x p array of means specific to 
exporter-importer-commodity combinations, and Eisanmxmxpxn array of residuals. Of interest 
here is how the deviations E of the data from the mean may be correlated across exporters, importers 
and commodities. One possible model for this residual variation would be to treat the p-dimensional 
residual vectors corresponding to each of the m x (m — 1) X n = 8700 exporter-importer-year 
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combinations as independent samples from a p-variate multivariate normal distribution. However, 
to accommodate potential temporal correlation (beyond that already accounted for by taking Y 
to be the lagged log trade values), the p x n residual matrices corresponding to each of the m x 
(m — 1) = 870 exporter-importer pairs could be modeled as independent samples from a matrix 
normal distribution, with two separate covariance matrices representing commodity and temporal 
correlation. This latter model can be described by an array normal model as 

Y ~ anorm(M o l n , I m o I m o S3 o S4), (2) 

where S3 and S4 describe covariance among commodities and time points, respectively. However, 
it is natural to consider the possibility that there will be correlation of residuals attributable to 
exporters and importers. For example, countries with similar economies may exhibit correlations 
in their trade patterns. With this in mind, we will also fit the following model: 

Y ~ anorm(Mo l n ,Si o S 2 o S 3 o S 4 ). (3) 

We consider Bayesian analysis for both of these models based on the prior distributions described 
at the end of the last section. The prior distribution for each S^ matrix being estimated is given 
by Sfc ~ inverse- Wishart(mfcI mfe /7, m^ + 2), with the hyperparameter 7 set so that 7^ = ||Y — 
Y o l n || 2 . As described in the previous section, this weakly centers the total variation of Y under 
the model around the the empirically observed value, similar to an empirical Bayes approach or 



unit information prior distribution [Kass and Wasserman 1995 . The prior distribution for M 
conditional on S is M ~ anorm(0, Si o S2 o S3), where Si = S2 = I m for model [2] 

Posterior distributions of parameters for both model [2] and [3] can be obtained using the results of 
the previous section with minor modifications. Under both models the mxmxp arrays Yi , . . . , Y n 
corresponding to the n = 10 time-points are not independent, but correlated according to S4. The 
full conditional distribution of M is still given by an array normal distribution, but the mean and 
variance are now as follows: 

E[M|Y,S] = KoM ° + 

Var[M|Y,S] = Si o S 2 o S 3 o S 4 /(k + ^ cf) 

where Yi, . . . , Y n are the mxmxp arrays obtained from the first three modes of the transformed 

~ — 1/2 — 1/2 

array Y = Y x 4 S 4 , and c\ , . . . , c n are the elements of the vector c = S 4 1 . Additionally, 
the time dependence makes it difficult to integrate p(M, S|Y) as was possible in the independent 
case. As a result, we use a Gibbs sampler that proceeds by sampling S& from its full conditional 
distribution p(Sfe| Y, M, S_&) as opposed to the p(S^|Y, S_^) as before. This full conditional 
distribution is still a member of the inverse- Wishart family: 

S fc |Y,M, S_ fc ~ inverse- Wishart ( [S fc + E (A .)E^. ) +R (Jt) R^] - Vofe + ^ x [1 + 1/n]), 
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where E(i), for example, is the £>mode matricization of (Y — Mo l n ) x {I m , S, 



-1/2 



-1/2 v,-l/2 



} 



and R(i) is the /c-mode matricization of (M — Mq) x {I, S 2 



-1/2 v -l/2 



}• 



Separate Markov chains for each of the two models were generated using 205,000 iterations of 
the Gibbs sampler discussed above. The first 5,000 iterations were dropped from each chain to 
allow for convergence to the stationary distribution, and parameter values were saved every 40th 
iteration thereafter, resulting in 5,000 parameter values with which to approximate the posterior 
distributions. Mixing of the Markov chain was assessed by computing the "effective sample size" , 
or equivalent number of independent simulations, of several summary parameters. For the full 
model, effective sample sizes of 70 = tr(Si ® • • • ® X4), 71 = tr(Si), . . . , 74 = tr(S4) were computed 
to be 2,545, 904, 960, 548 and 1,734. Note that 71, ... ,74 are not separately identifiable from the 
data, resulting in poorer mixing than 70, which is identifiable. For the reduced model, the effective 
sample sizes of 70, 73 and 74 were 4,281, 1,194 and 1,136. 
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Figure 1: Posterior predictive distributions for summary statistics. The gray density represents is 
under the reduced model, the black under the full. The vertical dashed line is the observed value 
of the statistic. 



The fits of the two models can be compared using posterior predictive evaluations Rubin, 1984 



To evaluate the fit of a model, the observed value of a summary statistic t(Y) can be compared 
to values t(Y) for which Y is simulated from the posterior predictive distribution. A discrepancy 
between i(Y) and the distribution of t(Y) indicates that the model is not capturing the aspect of 
the data represented by £(•). For illustration, we use such checks here to evaluate evidence that Si 
and S2 are not equal to the identity, or equivalently, that model 1 exhibits lack of fit as compared 
to model 2. To obtain a summary statistic evaluating evidence of a non-identity covariance matrix 
for Si, we first subtract the sample mean from the data array to obtain E = Y — Y, and then 
compute Si = (E^E^). The mxm matrix Si is a sample measure of covariance among exporting 
countries. We then obtain a scaled version Si = Si/tr(Si), and compare it to a scaled version of 
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Figure 2: Estimates of correlation matrices corresponding to Si, S2 and S3. The first panel in 
each row plots the eigenvalues of each correlation matrix, the second plots the first two eigenvectors. 

the identity matrix: 

t\(Y) = log I Si I — log \I/m\ = log I Si I + mlogm. 

Note that the minimum value of this statistic occurs when Si = I/m, and so in some sense it 
provides a simple scalar measure of how the sample covariance among exporters differs from a 
scaled identity matrix. Similarly, we construct t% ( Y) and £3 ( Y) measuring sample covariance along 
the second and third modes of the data array. We include £3 to contrast with t\ and £2, as both 
the full and reduced models include covariance parameters for the third dimension of the array. 

Figure [l] plots posterior predictive densities for ii(Y), i2(Y) and ts(Y) under both the full and 
reduced models, and compares these densities to the observed values of the statistics. The reduced 
model exhibits substantial lack of fit in terms of its inability to predict datasets that resemble the 
observed in terms of t± and £2- in other words, a model that assumes i.i.d. structure along the first 
two modes of the array does not fit the data. In terms of covariance among commodities along the 
third mode, neither model exhibits substantial lack of fit as measured by £3. 
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Figure [2] describes posterior mean estimates of the correlation matrices corresponding to Si, 
S2 and S3. The two panels in each column plot the eigenvalues and the first two eigenvectors each 
of the three correlation matrices. The eigenvalues for all three suggest the possibility of modeling 
the covariances matrices with factor analytic structure, i.e. letting S& = AA T + diag(fef , . . . , 6jL ), 
where A is an x r matrix with r < mp.. This possibility is described further in the Discussion. 
The second row of Figure [2] describes correlations among exporters, importers and commodities. 
The first two plots show that much of the correlation among exporters and among importers is 
related to geography, with countries having similar eigenvector values typically being near each 
other geographically. The third plot in the row indicates correlation among commodities of a 
similar type: Moving up and to the right from "crude materials," the commodities are essentially 
in order of the extent to which they are finished goods. 



5 Discussion 



This article has proposed a class of array normal distributions with separable covariance structure 
for the modeling of array data, particularly multivariate relational arrays. It has been shown that 
the class of separable covariance models can be generated by a type of array-matrix multiplication 
known as the Tucker product. Maximum likelihood estimation and Bayesian inference in Gaussian 
versions of such models are feasible using relatively simple properties of array-matrix multiplication. 
These types of models can be useful for describing covariance within the index sets of an array 
dataset. In an example involving longitudinal trade data, we have shown that a full array normal 
model provides a better fit than a matrix normal model, which includes a covariance matrix for 
only two of the four modes of the data array. 

One open area of research is finding conditions for the existence and uniqueness of the MLE 
for the array normal model. In fact, such conditions for even the simple matrix normal case are 



not fully established. Each step of the iterative MLE algorithm of Dutilleul 1999 is well defined 
as long as 11 > max{mi/m2,m2/"ii} + 1, suggesting that this condition might be sufficient to 
provide existence. Unfortunately, it is easy to find examples where this condition is met but the 
likelihood is unbounded, and no MLE exists. In contrast, Srivastava et al. [2008 show the MLE 
exists and is essentially unique if n > max{mi,ra2}, a much stronger requirement. However, they 
do not show that this condition is necessary. Unpublished results of this author suggest that this 
latter condition can be relaxed, and that the existence of the MLE depends on the minimal possible 
rank of Y17=i YjUfcUjYj" for k < minjmi, 771,2} , where Ufc is an 771-2 X k matrix with orthonormal 
columns. However, such conditions are somewhat opaque, and it is not clear that they could easily 
be generalized to the array normal case. 

A potentially useful model variation would be to consider imposing simplifying structure on the 
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component matrices. For example, a normal factor analysis model for a random vector y £ W posits 
that y = /x + Bz + De where z £ M r , r < p and e 6 M p are uncorrelated standard normal vectors 
and D is a diagonal matrix. The resulting covariance matrix is given by Cov[y] = BB T + D 2 , 
in which the "interesting" part BB T is of rank r. The natural extension to random arrays is 
Y; = M + Z x {Bi,...,Bjc} + E x {Di,...,D^} where Z £ Rnx-xr^ and E e RP ix-x PK are 
uncorrelated standard normal arrays. This induces the covariance matrix Cov[vec(Y)] = (B^B^)® 
• • • (BiBi) r + Y) 2 K ■ ■ ■ Df . This is essentially the model-based analogue of the higher-order 



SVD of De Lathauwer et al. 2000 , in the same way that the usual factor analysis model for vector- 
valued data is analogous to the matrix SVD. Alternatively, in some cases it may be desirable to fit 
a factor-analytic structure for the covariances of some modes of the array while estimating others 
as unstructured. This can be achieved with a model of the following form: 

Y = M + Z x {S; /2 ,..,S; /2 ,B Wl ...,Brf + Ex {s} /2 , . . . , sf 2 , D fe+1 , . . . , D x } 

where Z £ MPi x --Pfc xr "fe+i x "' r if f anc j E £ W>^ x -- x pk . The resulting covariance for Y is given by 

Cov[vec(Y)] = (B^B^ + T> 2 K ) ® • ■ • ® (B fc+ iB fc+ i + D| +1 ) T ® E* ® • • • ® Si, 

which is separable, and so is in the array normal class. Such a factor model may be useful if some 
modes of the array have very high dimensions, and rank- reduced estimates of the corresponding 
covariance matrices are desired. 

An additional model extension would be to accommodate non-normal continuous or discrete 
array data, for example, dynamic binary networks. This can be done by embedding the array 
normal model within a generalized linear model, or within an ordered probit model for ordinal 
response data. For example, if Y is a three-way binary array, an array normal probit model would 
posit a latent array Z ~ anorm(M, Si o S2 o S3) which determines Y via yij t k = ^<o,oo)( z i,j,k)- 



Computer code and data for the example in Section 5 is available at my website: http://www 



stat . Washington . edu/~hof f 



Appendix 



Proof of Proposition 3.1. Let Y = Z x A where the elements of Z are uncorrelated, have 
expectation zero and variance one. Using the fact that Ym = AiZ^~B T where B = (A#-®- ■ -® A2) 



Kolda 2006, Proposition 4.3], we have 
vec(Y) = vec(Y(x)) = 



vec(A 1 Z (1) B T ) 
(B ® Ai)vec(Z (1) ) 



(B ® Ai)vec(Z). 
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The covariance of vec(Y) is then 

E[vec(Y)vec(Y) T ] = (B ® Ai)E[vec(Z)vec(Z) J ](B <8 Ai 
= (Ax <8> • • • <8> Ai)I(Ak® • •• ® A X ) T 
= (Aa-A^-) ® • • • ® (AiAf) 
= Sa- ® • ■ ■ (8) Si, 

where = A^A^ . This proves the second statement in the proposition. The first statement 
follows from how the "vec" operation is applied to arrays. For the third statement, consider the 
calculation of E[Y(i)YjTj, again using the fact that Y(i) = AiZ(!)B T : 

E[Y (1) Yf 1} ] = AiE[Z (1) B T BZf 1) ]A?' 

= AiE[XX r ]Af, (4) 

where X = Z^B^. Because the elements of Z are all independent, mean zero and variance one, 
the rows of X are independent with mean zero and variance BB T . Thus E[XX T ] = tr(BB T )I. 
Combining this with Q gives 

E[Y (1) Yfn] = AxAftr(BB T ) 



= AiAftr([AAr<g>---<8> A 2 ] [A^ <8 • • • <8 Af] 
= Eitr(Ejf ® ■ • • ® Ei) 

K 

= EiJJtr(E fe ). 

k=2 

Calculation of EfY^Yj^J for other values of k is similar. □ 

Proof of Proposition 3.2. We calculate E[vec(X)vec(X) T ] for the case that X = Y Xi G: 

vec(X) = vec(X(!)) = vec(GY( 1 )) 

= vec(GY (1) I) 

= (I <8 G)vec(Y) , so 

E[vec(X)vec(X) T ] = (I ® G)E[vec(Y)vec(Y) T ](I ® G) T 

= (I®G)(Ea-® •••(8) Ei)(I® G T ) 

= [(Sa-®---(8E 2 )®(GE 1 )][I0G t ] 

= (S K ®"-®S 2 )®(GSiG T ). 

Calculation for the covariance of X = Y x& G for other values of k proceeds analogously. □ 

Proof of Proposition 4.1 The density can be obtained as a re-expression of the density 
of e = vec(E) = vec(Y — M), which has a multivariate normal distribution with mean zero and 
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covariance S^- (8 • • • <8 Si. The re-expression is obtained using the following identities, 

||(y- m) x sr 1 / 2 !! 2 = (E,Exr') 

= vec(E) T vec(E x 5T 1 ) 

= e T (S^ ® • • • <8 Ei) -1 e , and 

K 

isjf®---®Eii = ni Sfc i nfc ' 

k=l 

where = Ylj.j^^mj is the number of columns of Y^, i.e. the "sample size" for □ 

Proof of Proposition 4.3. We first obtain the full conditional distributions for the matrix 
normal case. Let Y ~ anorm(0,E <S> O) and S _1 = ^f. Let (a, b) form a partition of the row 
indices of Y, and assume the rows of Y are ordered according to this partition. The quadratic term 
in the exponent of the density can then be written as 

trCn-^^Y) = tr (n-HY T a Yl) ( ^ ) ( y[ )) 

= trCO-^^ooYo) + 2tr(n- 1 Y^* 6o Y ) + tr^Y^Yi,). 

As a function of Y&, this is equal to a constant plus the quadratic term of the matrix normal density 
with row and column covariance matrices of SfrT, and ft , and a mean of — flT, f2{, a Y a . Standard 
results on inverses of partitioned matrices give the row variance as ty^ 1 = E&6 — E6 a (E aa ) -1 E a 6 = 
E fe | a and the mean as — fl^ b f2& a Y a = Ej > | (E 00 )""" 1 Y&. To obtain the result for the array case, 
note that if Y ~ anorm(0, o • • ■ o E^-) then the distribution of Yn) is matrix normal with 
row covariance Si and column covariance S^ (8) • • • ® Sa- The conditional distribution can then 
be obtained by applying the result for the matrix normal case to Y^ with S = Si and ft = 
Sjf ® • • • ® S 2 . □ 
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